37 research outputs found

    'What are you referring to?' Evaluating the Ability of Multi-Modal Dialogue Models to Process Clarificational Exchanges

    Full text link
    Referential ambiguities arise in dialogue when a referring expression does not uniquely identify the intended referent for the addressee. Addressees usually detect such ambiguities immediately and work with the speaker to repair it using meta-communicative, Clarificational Exchanges (CE): a Clarification Request (CR) and a response. Here, we argue that the ability to generate and respond to CRs imposes specific constraints on the architecture and objective functions of multi-modal, visually grounded dialogue models. We use the SIMMC 2.0 dataset to evaluate the ability of different state-of-the-art model architectures to process CEs, with a metric that probes the contextual updates that arise from them in the model. We find that language-based models are able to encode simple multi-modal semantic information and process some CEs, excelling with those related to the dialogue history, whilst multi-modal models can use additional learning objectives to obtain disentangled object representations, which become crucial to handle complex referential ambiguities across modalities overall.Comment: Accepted at SIGDIAL'23 (upcoming). Repository with code and experiments available at https://github.com/JChiyah/what-are-you-referring-t

    Imagining Grounded Conceptual Representations from Perceptual Information in Situated Guessing Games

    Get PDF
    In visual guessing games, a Guesser has to identify a target object in a scene by asking questions to an Oracle. An effective strategy for the players is to learn conceptual representations of objects that are both discriminative and expressive enough to ask questions and guess correctly. However, as shown by Suglia et al. (2020), existing models fail to learn truly multi-modal representations, relying instead on gold category labels for objects in the scene both at training and inference time. This provides an unnatural performance advantage when categories at inference time match those at training time, and it causes models to fail in more realistic "zero-shot" scenarios where out-of-domain object categories are involved. To overcome this issue, we introduce a novel "imagination" module based on Regularized Auto-Encoders, that learns context-aware and category-aware latent embeddings without relying on category labels at inference time. Our imagination module outperforms state-of-the-art competitors by 8.26% gameplay accuracy in the CompGuessWhat?! zero-shot scenario (Suglia et al., 2020), and it improves the Oracle and Guesser accuracy by 2.08% and 12.86% in the GuessWhat?! benchmark, when no gold categories are available at inference time. The imagination module also boosts reasoning about object properties and attributes.Comment: Accepted to the International Conference on Computational Linguistics (COLING) 202

    CompGuessWhat?!: A Multi-task Evaluation Framework for Grounded Language Learning

    Full text link
    Approaches to Grounded Language Learning typically focus on a single task-based final performance measure that may not depend on desirable properties of the learned hidden representations, such as their ability to predict salient attributes or to generalise to unseen situations. To remedy this, we present GROLLA, an evaluation framework for Grounded Language Learning with Attributes with three sub-tasks: 1) Goal-oriented evaluation; 2) Object attribute prediction evaluation; and 3) Zero-shot evaluation. We also propose a new dataset CompGuessWhat?! as an instance of this framework for evaluating the quality of learned neural representations, in particular concerning attribute grounding. To this end, we extend the original GuessWhat?! dataset by including a semantic layer on top of the perceptual one. Specifically, we enrich the VisualGenome scene graphs associated with the GuessWhat?! images with abstract and situated attributes. By using diagnostic classifiers, we show that current models learn representations that are not expressive enough to encode object attributes (average F1 of 44.27). In addition, they do not learn strategies nor representations that are robust enough to perform well when novel scenes or objects are involved in gameplay (zero-shot best accuracy 50.06%).Comment: Accepted to the Annual Conference of the Association for Computational Linguistics (ACL) 202

    An Empirical Study on the Generalization Power of Neural Representations Learned via Visual Guessing Games

    Get PDF
    Guessing games are a prototypical instance of the "learning by interacting" paradigm. This work investigates how well an artificial agent can benefit from playing guessing games when later asked to perform on novel NLP downstream tasks such as Visual Question Answering (VQA). We propose two ways to exploit playing guessing games: 1) a supervised learning scenario in which the agent learns to mimic successful guessing games and 2) a novel way for an agent to play by itself, called Self-play via Iterated Experience Learning (SPIEL). We evaluate the ability of both procedures to generalize: an in-domain evaluation shows an increased accuracy (+7.79) compared with competitors on the evaluation suite CompGuessWhat?!; a transfer evaluation shows improved performance for VQA on the TDIUC dataset in terms of harmonic average accuracy (+5.31) thanks to more fine-grained object representations learned via SPIEL.Comment: Accepted paper for the 16th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2021

    Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neural Networks

    Get PDF
    Abstract. In this work we propose Ask Me Any Rating (AMAR), a novel content-based recommender system based on deep neural networks which is able to produce top-N recommendations leveraging user and item embeddings which are learnt from textual information describing the items. A comprehensive experimental evaluation conducted on stateof-the-art datasets showed a significant improvement over all the baselines taken into account

    Going for GOAL: A Resource for Grounded Football Commentaries

    Full text link
    Recent video+language datasets cover domains where the interaction is highly structured, such as instructional videos, or where the interaction is scripted, such as TV shows. Both of these properties can lead to spurious cues to be exploited by models rather than learning to ground language. In this paper, we present GrOunded footbAlL commentaries (GOAL), a novel dataset of football (or `soccer') highlights videos with transcribed live commentaries in English. As the course of a game is unpredictable, so are commentaries, which makes them a unique resource to investigate dynamic language grounding. We also provide state-of-the-art baselines for the following tasks: frame reordering, moment retrieval, live commentary retrieval and play-by-play live commentary generation. Results show that SOTA models perform reasonably well in most tasks. We discuss the implications of these results and suggest new tasks for which GOAL can be used. Our codebase is available at: https://gitlab.com/grounded-sport-convai/goal-baselines.Comment: Preprint formatted using the ACM Multimedia template (8 pages + appendix

    Visually grounded representation learning using language games for embodied AI

    Get PDF
    The ability to communicate in Natural Language is considered one of the ingredients that facilitated the development of humans’ remarkable intelligence. Analogously, developing artificial agents that can seamlessly integrate with humans requires them to understand, and use Natural Language, just like we do. Humans use Natural Language to coordinate and communicate relevant information to solve their tasks—they play so-called “language games”. In this thesis work, we explore computational models of how meanings can materialise in situated and embodied language games. Meanings are instantiated when language is used to refer to, and to do things in the world. In these activities, such as “guessing an object in an image” or “following instructions to complete a task”, perceptual experience can be used to derive grounded meaning representations. Considering that di↵erent language games favour the development of specific concepts, we argue it is detrimental to evaluate agents on their ability to solve a single task. To mitigate this problem, we define GroLLA, a multi-task evaluation framework for visual guessing games that extends a goal-oriented evaluation with auxiliary tasks aimed at assessing the quality of the representations as well. By using this framework, we demonstrate the inability of recent computational models to learn truly multimodal representations that can generalise to unseen object categories. To overcome this issue, we propose a representation learning component that derives concept representations from perceptual experience, obtaining substantial gains over the baselines—especially when unseen object categories are involved. To demonstrate that guessing games are a generic procedure for grounded language learning, we present SPIEL, a novel self-play procedure to transfer learned representations to novel multimodal tasks. We show that models trained in this way can obtain better performance as well as learn better concept representations than competitors. Thanks to this procedure, artificial agents can learn from interaction using any image-based datasets. Additionally, learning the meaning of concepts involves understanding how entities interact with other entities in the world. For this purpose, we use action-based and event-driven language games to study how an agent can learn visually grounded conceptual representations from dynamic scenes. We design EmBERT, a generic architecture for an embodied agent able to learn representations useful to complete language-guided action execution tasks in a 3D environment. Finally, learning visually grounded representations can be achieved when watching others completing a task. Inspired by this idea, we study how to learn representations from videos that can be used for tackling multimodal tasks such as commentary generation. For this purpose, we define Goal, a highly multimodal benchmark based on football commentaries that requires models to learn very fine-grained and rich representations to be successful. We conclude with some future directions for further progress in computational learning of grounded meaning representations

    An Analysis of Visually Grounded Instructions in Embodied AI Tasks

    No full text
    Thanks to Deep Learning models able to learn from Internet-scale corpora, we observed tremendous advances in both text-only and multi-modal tasks such as question answering and image captioning. However, real-world tasks require agents that are embodied in the environment and can collaborate with humans by following language instructions. In this work, we focus on ALFRED, a large-scale instruction-following dataset proposed to develop artificial agents that can execute both navigation and manipulation actions in 3D simulated environments. We present a new Natural Language Understanding component for Embodied Agents as well as an in-depth error analysis of the model failures for this challenge, going beyond the success-rate performance that has been driving progress on this benchmark. Furthermore, we provide the research community with important directions for future work in this field which are essential to develop collaborative embodied agents.</p
    corecore